Data Visualization and Reproducible Research

Rene Perez.

The following is a sample of products created during the “Data Visualization and Reproducible Research” course.

Project 01

In this project, I explored The Billboard Summer Hits and the Students Performance data set; on this file, I am focusing in the Students Performance, I want find the relationship between the Exam Score and the other features like time in netflix, study hrs and others, so I explored the data set creating plot and a multiple linear regression analysis. Please the Find the code and report in theproject_01/ folder

Sample data visualization:

Student Habits performance
Student Habits performance

Linear Regression table

Dependent variable:
exam_score
age -0.012
(0.074)
genderMale 0.146
(0.348)
genderOther 0.793
(0.866)
study_hours_per_day 9.575***
(0.116)
social_media_hours -2.602***
(0.145)
netflix_hours -2.282***
(0.158)
part_time_jobYes 0.211
(0.414)
attendance_percentage 0.143***
(0.018)
sleep_hours 1.992***
(0.138)
diet_qualityGood -0.683*
(0.378)
diet_qualityPoor -0.272
(0.473)
exercise_frequency 1.450***
(0.084)
parental_education_levelHigh School -0.160
(0.396)
parental_education_levelMaster -0.411
(0.508)
parental_education_levelNone -0.702
(0.633)
internet_qualityGood -0.473
(0.373)
internet_qualityPoor -0.082
(0.503)
mental_health_rating 1.944***
(0.060)
extracurricular_participationYes -0.014
(0.364)
Constant 7.177***
(2.503)
Observations 1,000
R2 0.902
Adjusted R2 0.900
Residual Std. Error 5.342 (df = 980)
F Statistic 473.908*** (df = 19; 980)
Note: p<0.1; p<0.05; p<0.01

Summary

This is a very strong model. The most important predictors of exam performance are:

📚 Study hours (positive)

📱 Social media and 📺 Netflix use (negative)

🛏️ Sleep (positive)

💪 Exercise and 😊 Mental health (positive)

🏫 Attendance (positive)


Project 02

In this project, I explored the California Housing Data set, and find the relationship between the price of house, and proximity to the ocean; for instance, I explored the data set creating plot and a multiple linear regression analysis. Please the Find the code and report in the project_02/ folder.

California Housing:

California housing
California housing

Linear Regression Output

Dependent variable:
median_house_value
longitude -26,812.990***
(1,019.651)
latitude -25,482.190***
(1,004.702)
housing_median_age 1,072.520***
(43.886)
total_rooms -6.193***
(0.791)
total_bedrooms 100.556***
(6.869)
population -37.969***
(1.076)
households 49.617***
(7.451)
median_income 39,259.570***
(338.005)
ocean_proximityINLAND -39,284.300***
(1,744.258)
ocean_proximityISLAND 152,901.900***
(30,741.880)
ocean_proximityNEAR.BAY -3,954.052**
(1,913.339)
ocean_proximityNEAR.OCEAN 4,278.134***
(1,569.525)
Constant -2,269,954.000***
(88,013.880)
Observations 20,433
R2 0.646
Adjusted R2 0.646
Residual Std. Error 68,656.950 (df = 20420)
F Statistic 3,111.608*** (df = 12; 20420)
Note: p<0.1; p<0.05; p<0.01

Executive Summary (Linear Regression Analysis)

  1. The model fits reasonably well (R² ≈ 0.65).

  2. Most variables are statistically significant.

  3. median_income is the strongest positive predictor.

  4. Location features (longitude, latitude, ocean_proximity) are very important.

  5. Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.



Project 03

In this project, I explored different visualizations, using geom_point(), geom_density(), and other ggplot visualizations; next I will show a Tampa weather plot.

Sample data visualization:

Tampa Weather
Tampa Weather

Summary of Insights:

  • Hot months (Jun–Aug) are warm and often wet, especially June and July.

  • Cold months (Dec–Feb) have lower average temperatures and relatively less rainfall.

  • Transitional months (Mar–May, Sep–Nov) show mixed weather, with both dry and wet days.

Moving Forward

Next steps:

  • Keep exploring the ggplot, SF, and other visualization packages.

  • Work in visualizations for machine learning models.

  • Keep working on mapping visualizations for spatial data.


  1. Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎